dense layer
Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices
Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts to develop alternatives have focused on a small number of hand-crafted structured matrices, and have neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, and Monarch, along with many novel structures. We develop a taxonomy of all such operators based on their computational and algebraic properties, which provides insights into their scaling laws. Combining these insights with empirical evaluation, we identify a subset of structures that achieve equal or better performance than dense layers as a function of training compute. To further improve their compute efficiency, we develop a natural extension of these performant structures that convert them into a sparse Mixture-of-Experts layer. The resulting layer significantly outperforms dense layers in compute-optimal training efficiency for GPT-2 language models.
A Experiment Details and Complete Results
A.2 Model Architectures In this section we describe in detail each of the model architectures we use in our experiments. Our small ConvNet consists of the following layers: A convolutional layer with 32 kernels of size 3 3 and ReLU activation. A max pooling layer with pool size 2 2. A flatten layer. For inputs of shape 32 32 3, this model has 21,697 parameters. Our large ConvNet model consists of the following layers: A convolutional layer with 32 kernels of size 3 3, padding, and ReLU activation.